iT邦幫忙

2025 iThome 鐵人賽

DAY 24
0
DevOps

不爆肝學習 Ansible 的短暫30天系列 第 24

Day24 - 監控做得好,晚上就睡好

  • 分享至 

  • xImage
  •  

今日目標

  • 了解監控系統在 DevOps 中的重要性
  • 學會使用 Ansible 部署 Prometheus + Grafana 監控解決方案
  • 實作 ELK Stack 日誌收集系統
  • 建立完整的監控告警機制

為什麼需要監控與日誌?

當我們的服務上線後,最怕的就是「不知道系統發生什麼事」。
想像一下這個情境:

  • 使用者反應網站很慢,但你不知道是哪裡出問題
  • 半夜服務掛掉,但你早上才發現
  • 想要擴容,但不知道瓶頸在哪裡

這就是為什麼我們需要 可觀測性 (Observability) 的三大支柱:

  1. Metrics(指標) - 系統的數值化表現
  2. Logs(日誌) - 系統的詳細記錄
  3. Traces(追蹤) - 請求在系統中的路徑
    今天我們重點放在前兩項!

監控技術選型

Prometheus + Grafana 組合

這是目前最受歡迎的開源監控解決方案:

Prometheus

  • 時間序列資料庫
  • Pull 模式收集指標
  • 強大的查詢語言 PromQL
  • 內建告警功能

Grafana

  • 視覺化 Dashboard 工具
  • 支援多種資料來源
  • 豐富的圖表類型
  • 美觀的使用者介面

ELK Stack 日誌處理

Elasticsearch + Logstash + Kibana

  • 分散式搜尋引擎
  • 強大的日誌處理能力
  • 靈活的查詢和分析
  • 即時日誌監控

Ansible 部署 Prometheus + Grafana

1. 建立 Monitoring Role 結構

mkdir -p roles/monitoring/{tasks,templates,defaults,handlers}

2. 定義預設變數

---
# roles/monitoring/defaults/main.yml
prometheus_version: "3.6.0"
grafana_version: "12.2.0"

# Prometheus 配置
prometheus_port: 9090
prometheus_data_dir: /opt/prometheus/data
prometheus_config_dir: /etc/prometheus

# Grafana 配置
grafana_port: 3000
grafana_data_dir: /var/lib/grafana
grafana_admin_password: "admin123"

# 監控目標
monitoring_targets:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'web-servers'
    static_configs:
      - targets: "{{ groups['web'] | map('extract', hostvars, 'ansible_default_ipv4') | map(attribute='address') | map('regex_replace', '^(.*)$', '\\1:9100') | list }}"

3. 主要任務編排

---
# roles/monitoring/tasks/main.yml
- name: Create monitoring users and directories
  block:
    - name: Create prometheus user
      user:
        name: prometheus
        system: yes
        shell: /bin/false
        home: "{{ prometheus_data_dir }}"
        create_home: no

    - name: Create prometheus directories
      file:
        path: "{{ item }}"
        state: directory
        owner: prometheus
        group: prometheus
        mode: '0755'
      loop:
        - "{{ prometheus_config_dir }}"
        - "{{ prometheus_data_dir }}"
        - /opt/prometheus

- name: Download and install Prometheus
  unarchive:
    src: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
    dest: /opt
    remote_src: yes
    owner: prometheus
    group: prometheus
    creates: "/opt/prometheus-{{ prometheus_version }}.linux-amd64"
  register: prometheus_download

- name: Create prometheus symlink
  file:
    src: "/opt/prometheus-{{ prometheus_version }}.linux-amd64"
    dest: /opt/prometheus/current
    state: link
    owner: prometheus
    group: prometheus
  when: prometheus_download.changed

- name: Deploy prometheus configuration
  template:
    src: prometheus.yml.j2
    dest: "{{ prometheus_config_dir }}/prometheus.yml"
    owner: prometheus
    group: prometheus
    mode: '0644'
  notify: restart prometheus

- name: Deploy prometheus systemd service
  template:
    src: prometheus.service.j2
    dest: /etc/systemd/system/prometheus.service
    mode: '0644'
  notify:
    - reload systemd
    - restart prometheus

- name: Install Node Exporter
  include_tasks: node_exporter.yml

- name: Install and configure Grafana
  include_tasks: grafana.yml

- name: Start and enable monitoring services
  systemd:
    name: "{{ item }}"
    state: started
    enabled: yes
    daemon_reload: yes
  loop:
    - prometheus
    - node-exporter
    - grafana-server

4. Node Exporter 安裝

---
# roles/monitoring/tasks/node_exporter.yml
- name: Download Node Exporter
  unarchive:
    src: "https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-amd64.tar.gz"
    dest: /opt
    remote_src: yes
    owner: prometheus
    group: prometheus
    creates: "/opt/node_exporter-1.9.1.linux-amd64"

- name: Create node-exporter symlink
  file:
    src: "/opt/node_exporter-1.9.1.linux-amd64"
    dest: /opt/node-exporter
    state: link

- name: Deploy node-exporter systemd service
  template:
    src: node-exporter.service.j2
    dest: /etc/systemd/system/node-exporter.service
  notify: reload systemd

5. Grafana 安裝配置

---
# roles/monitoring/tasks/grafana.yml
- name: Add Grafana repository key
  apt_key:
    url: https://packages.grafana.com/gpg.key
    state: present
  when: ansible_os_family == "Debian"

- name: Add Grafana repository
  apt_repository:
    repo: "deb https://packages.grafana.com/oss/deb stable main"
    state: present
  when: ansible_os_family == "Debian"

- name: Install Grafana
  package:
    name: grafana
    state: present

- name: Configure Grafana
  template:
    src: grafana.ini.j2
    dest: /etc/grafana/grafana.ini
    backup: yes
  notify: restart grafana

- name: Import Grafana dashboards
  uri:
    url: "http://localhost:{{ grafana_port }}/api/dashboards/db"
    method: POST
    headers:
      Content-Type: "application/json"
      Authorization: "Basic {{ ('admin:' + grafana_admin_password) | b64encode }}"
    body_format: json
    body:
      dashboard:
        id: null
        title: "Node Exporter Full"
        tags: ["prometheus", "node-exporter"]
        timezone: "browser"
        panels: []
        time:
          from: "now-6h"
          to: "now"
        refresh: "30s"
      folderId: 0
      overwrite: true
  retries: 5
  delay: 10

6. 配置模板

Prometheus 配置模板

# roles/monitoring/templates/prometheus.yml.j2
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
{% for target in monitoring_targets %}
  - job_name: '{{ target.job_name }}'
    static_configs:
{% for config in target.static_configs %}
      - targets: {{ config.targets }}
{% endfor %}
{% endfor %}

  # 自動發現 Docker 容器
  - job_name: 'docker'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      - source_labels: [__meta_docker_container_name]
        target_label: container_name

系統服務模板

# roles/monitoring/templates/prometheus.service.j2
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/current/prometheus \
    --config.file={{ prometheus_config_dir }}/prometheus.yml \
    --storage.tsdb.path={{ prometheus_data_dir }} \
    --web.console.templates=/opt/prometheus/current/consoles \
    --web.console.libraries=/opt/prometheus/current/console_libraries \
    --web.listen-address=0.0.0.0:{{ prometheus_port }} \
    --web.external-url=http://{{ ansible_default_ipv4.address }}:{{ prometheus_port }}

[Install]
WantedBy=multi-user.target

7. Handlers 定義

---
# roles/monitoring/handlers/main.yml
- name: reload systemd
  systemd:
    daemon_reload: yes

- name: restart prometheus
  systemd:
    name: prometheus
    state: restarted

- name: restart grafana
  systemd:
    name: grafana-server
    state: restarted

ELK Stack 日誌處理

1. 建立 ELK Role

---
# roles/elk/tasks/main.yml
- name: Install Java (required for Elasticsearch and Logstash)
  package:
    name: openjdk-11-jdk
    state: present

- name: Add Elastic repository
  block:
    - name: Add Elasticsearch signing key
      apt_key:
        url: https://artifacts.elastic.co/GPG-KEY-elasticsearch
        state: present

    - name: Add Elastic repository
      apt_repository:
        repo: "deb https://artifacts.elastic.co/packages/8.x/apt stable main"
        state: present

- name: Install Elasticsearch
  package:
    name: elasticsearch
    state: present
  notify: restart elasticsearch

- name: Configure Elasticsearch
  template:
    src: elasticsearch.yml.j2
    dest: /etc/elasticsearch/elasticsearch.yml
    backup: yes
  notify: restart elasticsearch

- name: Install and configure Kibana
  include_tasks: kibana.yml

- name: Install and configure Logstash
  include_tasks: logstash.yml

- name: Start and enable ELK services
  systemd:
    name: "{{ item }}"
    state: started
    enabled: yes
    daemon_reload: yes
  loop:
    - elasticsearch
    - kibana
    - logstash

2. Logstash 配置範例

# roles/elk/templates/logstash.conf.j2
input {
  beats {
    port => 5044
  }

  # 收集系統日誌
  file {
    path => "/var/log/syslog"
    type => "syslog"
    start_position => "beginning"
  }

  # 收集 Nginx 日誌
  file {
    path => "/var/log/nginx/access.log"
    type => "nginx_access"
    start_position => "beginning"
  }
}

filter {
  if [type] == "nginx_access" {
    grok {
      match => {
        "message" => "%{NGINXACCESS}"
      }
    }

    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }

    mutate {
      convert => {
        "response_code" => "integer"
        "bytes" => "integer"
      }
    }
  }

  if [type] == "syslog" {
    grok {
      match => {
        "message" => "%{SYSLOGTIMESTAMP:timestamp} %{IPORHOST:host} %{PROG:program}: %{GREEDYDATA:message}"
      }
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "%{type}-%{+YYYY.MM.dd}"
  }

  # Debug 輸出
  stdout {
    codec => rubydebug
  }
}

告警機制設定

1. AlertManager 配置

---
# roles/monitoring/tasks/alertmanager.yml
- name: Download and install AlertManager
  unarchive:
    src: "https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz"
    dest: /opt
    remote_src: yes
    owner: prometheus
    group: prometheus

- name: Configure AlertManager
  template:
    src: alertmanager.yml.j2
    dest: /etc/prometheus/alertmanager.yml
    owner: prometheus
    group: prometheus
  notify: restart alertmanager

- name: Deploy alert rules
  template:
    src: alert_rules.yml.j2
    dest: /etc/prometheus/rules/alert_rules.yml
    owner: prometheus
    group: prometheus
  notify: restart prometheus

2. 告警規則範例

# roles/monitoring/templates/alert_rules.yml.j2
groups:
- name: system_alerts
  rules:

  # CPU 使用率過高
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"

  # 記憶體使用率過高
  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
      description: "Memory usage is above 85% on {{ $labels.instance }}"

  # 磁碟空間不足
  - alert: DiskSpaceLow
    expr: node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"} * 100 < 10
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Disk space running low"
      description: "Disk space is below 10% on {{ $labels.instance }} filesystem {{ $labels.mountpoint }}"

  # 服務掛掉
  - alert: ServiceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Service is down"
      description: "Service {{ $labels.job }} on {{ $labels.instance }} is down"

3. Slack 通知配置

# roles/monitoring/templates/alertmanager.yml.j2
global:
  slack_api_url: '{{ slack_webhook_url }}'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'slack-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    username: 'Prometheus'
    icon_emoji: ':fire:'
    title: 'Alert: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    actions:
    - type: button
      text: 'View in Grafana'
      url: 'http://{{ grafana_url }}/d/node-exporter-full'

實戰部署 Playbook

主要部署 Playbook

---
# monitoring-deploy.yml
- name: Deploy complete monitoring stack
  hosts: monitoring
  become: yes
  vars:
    slack_webhook_url: "{{ vault_slack_webhook }}"
    grafana_admin_password: "{{ vault_grafana_password }}"

  roles:
    - monitoring
    - elk

  post_tasks:
    - name: Wait for services to be ready
      uri:
        url: "http://{{ ansible_default_ipv4.address }}:{{ item }}/api/health"
        method: GET
        status_code: 200
      retries: 30
      delay: 10
      loop:
        - 9090  # Prometheus
        - 3000  # Grafana

    - name: Display access URLs
      debug:
        msg:
          - "Prometheus: http://{{ ansible_default_ipv4.address }}:9090"
          - "Grafana: http://{{ ansible_default_ipv4.address }}:3000 (admin/{{ grafana_admin_password }})"
          - "Kibana: http://{{ ansible_default_ipv4.address }}:5601"

# 部署到應用伺服器的 Node Exporter
- name: Deploy monitoring agents
  hosts: web,db
  become: yes
  tasks:
    - name: Install Node Exporter
      include_role:
        name: monitoring
        tasks_from: node_exporter.yml

    - name: Configure Filebeat for log shipping
      include_role:
        name: elk
        tasks_from: filebeat.yml

Dashboard 設計最佳實務

1. 系統總覽 Dashboard

重要指標包括:

  • 四個黃金訊號:延遲、流量、錯誤、飽和度
  • 系統資源:CPU、記憶體、磁碟、網路
  • 應用指標:回應時間、處理量、錯誤率
  • 業務指標:使用者數、交易量、轉換率

2. 實用的 Grafana Panel 查詢

# CPU 使用率
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 記憶體使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# 磁碟 I/O
rate(node_disk_reads_completed_total[5m])
rate(node_disk_writes_completed_total[5m])

# 網路流量
rate(node_network_receive_bytes_total{device!="lo"}[5m])
rate(node_network_transmit_bytes_total{device!="lo"}[5m])

# HTTP 請求量 (需要應用程式提供 metrics)
rate(http_requests_total[5m])

# HTTP 錯誤率
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

作業練習時間

練習一:基礎監控部署

  1. 使用 Ansible 部署 Prometheus + Grafana 到一台伺服器
  2. 在其他伺服器安裝 Node Exporter
  3. 驗證能夠收集到系統指標
  4. 匯入現成的 Node Exporter Dashboard

練習二:自定義告警規則

  1. 建立針對你的應用的告警規則
  2. 設定 Slack 通知機制
  3. 測試告警是否正常觸發
  4. 調整告警閾值避免誤報

練習三:日誌收集實作

  1. 部署 ELK Stack
  2. 配置 Filebeat 收集應用程式日誌
  3. 在 Kibana 中建立實用的儀表板
  4. 設定日誌告警規則

練習四:應用程式監控

  1. 在你的應用中加入 metrics 端點
  2. 配置 Prometheus 收集應用指標
  3. 建立應用程式專屬的 Dashboard
  4. 設定 SLA 相關的告警

監控最佳實務心法

1. 監控策略

  • 監控金字塔:基礎設施 → 應用程式 → 業務
  • 四個黃金信號:延遲、流量、錯誤、飽和度
  • RED 方法:Rate, Errors, Duration
  • USE 方法:Utilization, Saturation, Errors

2. 告警設計原則

  • 可執行的告警:每個告警都應該有明確的處理步驟
  • 避免告警疲勞:合理設定閾值和抑制規則
  • 分層告警:不同嚴重程度使用不同通知方式
  • 告警收斂:同類問題只發一次告警

3. Dashboard 設計技巧

  • 以用戶為中心:不同角色看不同的 Dashboard
  • 關注趨勢:不只是當前數值,更要看變化趨勢
  • 上下文關聯:在同一個 Dashboard 展示相關指標
  • 可鑽取分析:從總覽到詳細的層次結構

4. 效能優化建議

  • 適當的抓取間隔:不需要每秒都抓取
  • 合理的保存策略:長期存儲降採樣
  • 標籤使用策略:避免高基數標籤
  • 查詢優化:善用 recording rules 預計算

明日預告

明天我們來看看容器與雲端整合,學習如何用 Ansible 管理 Docker 和雲端資源!


上一篇
Day23 - CI/CD 整合 Ansible 流程
系列文
不爆肝學習 Ansible 的短暫30天24
圖片
  熱門推薦
圖片
{{ item.channelVendor }} | {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言